I simulate GWAS \(p\)-values for 80,356 SNPs on chromosome 22 (with \(MAF>0.05\)) including 14,234 independent SNPs. For each LD block (24 of these), I use simGWAS to simulate results from 100 GWAS’ (varying the shared CVs).
The example shown below has 52 CVs:
I further define a set of “truly associated” as CVs + those SNPs with \(r^2>0.8\) with ANY of the causals (1227 of these in this example) and a set of “truly not-associated” as a set of SNPs with \(r^2<0.2\) with ALL the causals (75,552 of these in this example). I.e. the vast majority of the SNPs are “truly not-associated”.
length(which(x$assoc==1 & x$p<=5*10^-8))/length(which(x$p<=5*10^-8))
## [1] 0.8534031
length(which(x$not_assoc==1 & x$p<=5*10^-8))/length(which(x$p<=5*10^-8))
## [1] 0.005235602
length(which(x$assoc==1 & x$p<=5*10^-8))/length(which(x$assoc==1))
## [1] 0.398533
length(which(x$not_assoc==1 & x$p>5*10^-8))/length(which(x$p>5*10^-8))
## [1] 0.9469311
I compare results with empirical cFDR for independent uniform q (no left-censoring or removal of L-curve segments).
Plotting v5 (v-vals after 5 iterations over independent uniform q) against the original p-values:
Based on my 110 simulations (11 simulations per array job take 24 hrs; can also present results as a table):
I’ll make a similar plot for specificity when my simulations have finished.
I now define:
I sample functionals from distribution A (rnorm(n, mean = sample(c(-2,-3),1), sd = sample(c(0.5,0.2), 1))) and nulls from distribution B (rnorm(n, mean = sample(c(2,3,4), 1), sample(c(1, 1.5), 1))).
But I need to think about how I can iterate so I am not violating the independence assumption. Perhaps:
but this may still be violating the independence assumption..
As above but sample nulls from distribution A with probability 0.9 and functionals from distribution A with probability 0.1.
I have ran non-negative matrix factorisation on the annotation data using the NNLM package (nmf is too slow for matrices of this size). [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6945623/]
res_out <- nnmf(A = as.matrix(annots), k = 50, method = "scd", verbose = 1)
I regress the \(log(p)\) values against the NMF co-ordinates and choose dimensions to iterate over based on the t-values.
Dimension 23 is picked out:
regression_df <- data.frame(chisq = log(p_df$european_ancestry_pval_rand)[ind], coords[ind,])
lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
##
## Call:
## lm(formula = chisq ~ ., data = regression_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.512 -0.372 0.326 0.733 1.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.975e-01 9.993e-03 -89.810 < 2e-16 ***
## X1 4.265e-04 2.618e-03 0.163 0.870560
## X2 8.269e-04 9.294e-04 0.890 0.373600
## X3 -1.681e-03 1.092e-03 -1.538 0.123958
## X4 -2.056e-03 2.271e-03 -0.905 0.365307
## X5 -9.684e-03 1.807e-03 -5.358 8.42e-08 ***
## X6 -8.175e-03 5.916e-03 -1.382 0.167018
## X7 3.196e-03 3.293e-03 0.971 0.331779
## X8 -3.585e-02 4.211e-03 -8.513 < 2e-16 ***
## X9 -6.462e-02 7.093e-03 -9.110 < 2e-16 ***
## X10 2.286e-03 1.690e-03 1.352 0.176260
## X11 -1.305e-03 1.396e-03 -0.935 0.349956
## X12 -6.764e-04 3.519e-04 -1.922 0.054600 .
## X13 -1.548e-01 1.315e-02 -11.766 < 2e-16 ***
## X14 -2.868e-03 2.546e-03 -1.127 0.259845
## X15 3.266e-03 1.623e-03 2.013 0.044159 *
## X16 -1.049e-01 1.448e-02 -7.247 4.25e-13 ***
## X17 -1.139e-03 2.014e-03 -0.566 0.571567
## X18 -3.271e-02 7.007e-03 -4.668 3.05e-06 ***
## X19 2.390e-03 1.940e-03 1.232 0.217969
## X20 -2.056e-02 1.940e-03 -10.593 < 2e-16 ***
## X21 -2.559e-03 2.970e-03 -0.862 0.388840
## X22 -3.953e-04 1.248e-03 -0.317 0.751362
## X23 -6.048e-03 3.640e-04 -16.617 < 2e-16 ***
## X24 -4.153e-03 1.755e-03 -2.366 0.017976 *
## X25 -1.695e-02 3.347e-03 -5.065 4.08e-07 ***
## X26 -3.790e-02 3.882e-03 -9.764 < 2e-16 ***
## X27 2.022e-03 1.674e-03 1.208 0.227159
## X28 -9.915e-03 4.140e-03 -2.395 0.016612 *
## X29 -3.257e-02 3.230e-03 -10.083 < 2e-16 ***
## X30 -3.268e-04 6.792e-04 -0.481 0.630460
## X31 -2.700e-03 1.700e-03 -1.588 0.112303
## X32 -5.452e-02 4.334e-03 -12.581 < 2e-16 ***
## X33 -9.969e-03 2.476e-03 -4.026 5.67e-05 ***
## X34 -5.771e-02 4.989e-03 -11.568 < 2e-16 ***
## X35 -7.829e-03 2.064e-03 -3.793 0.000149 ***
## X36 -2.944e-02 2.500e-03 -11.777 < 2e-16 ***
## X37 1.455e-03 9.734e-05 14.943 < 2e-16 ***
## X38 7.593e-03 4.473e-03 1.697 0.089606 .
## X39 -2.389e-03 1.535e-03 -1.556 0.119700
## X40 -1.702e-04 6.289e-04 -0.271 0.786735
## X41 -3.483e-03 2.049e-03 -1.700 0.089221 .
## X42 7.219e-03 2.731e-03 2.643 0.008217 **
## X43 -3.081e-03 1.156e-03 -2.665 0.007704 **
## X44 5.589e-03 2.762e-03 2.023 0.043050 *
## X45 -2.681e-03 1.155e-03 -2.321 0.020287 *
## X46 1.210e-02 3.082e-03 3.927 8.60e-05 ***
## X47 4.787e-03 2.886e-03 1.659 0.097126 .
## X48 4.019e-03 3.400e-03 1.182 0.237196
## X49 5.680e-03 1.861e-03 3.051 0.002278 **
## X50 -2.262e-02 2.222e-03 -10.178 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.12 on 513061 degrees of freedom
## Multiple R-squared: 0.002451, Adjusted R-squared: 0.002354
## F-statistic: 25.21 on 50 and 513061 DF, p-value: < 2.2e-16
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
## (Intercept) X23 X37 X32 X36 X13
## 89.8102720 16.6166604 14.9426067 12.5805594 11.7768096 11.7658887
## X34 X20 X50 X29 X26 X9
## 11.5677220 10.5931708 10.1777566 10.0828724 9.7641775 9.1101797
## X8 X16 X5 X25 X18 X33
## 8.5130299 7.2474122 5.3580564 5.0653892 4.6677431 4.0261768
## X46 X35 X49 X43 X42 X28
## 3.9270036 3.7930278 3.0514003 2.6647892 2.6430475 2.3951889
## X24 X45 X44 X15 X12 X41
## 2.3661286 2.3210084 2.0232305 2.0125844 1.9220476 1.6995265
## X38 X47 X31 X39 X3 X6
## 1.6974834 1.6589515 1.5879284 1.5560350 1.5383735 1.3818517
## X10 X19 X27 X48 X14 X7
## 1.3523616 1.2319484 1.2077098 1.1820246 1.1267586 0.9705385
## X11 X4 X2 X21 X17 X30
## 0.9346764 0.9052997 0.8897520 0.8617245 0.5657452 0.4810802
## X22 X40 X1
## 0.3168444 0.2705533 0.1629472
But this only has a correlation of -0.005 with p…
I regress the \(log(p)\) values against the co-ordinates and choose dimensions to iterate over based on the t-values.
regression_df <- data.frame(chisq = log(p)[ind], coords[ind,])
lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
##
## Call:
## lm(formula = chisq ~ ., data = regression_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.487 -0.372 0.326 0.733 1.976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.042e+00 1.916e-03 -543.968 < 2e-16 ***
## PC1 5.039e-03 6.052e-04 8.325 < 2e-16 ***
## PC2 -2.367e-02 1.270e-03 -18.637 < 2e-16 ***
## PC3 2.496e-03 1.721e-03 1.450 0.146959
## PC4 -9.077e-03 3.198e-03 -2.838 0.004538 **
## PC5 3.654e-02 8.196e-03 4.458 8.28e-06 ***
## PC6 3.603e-02 2.442e-03 14.756 < 2e-16 ***
## PC7 2.811e-02 6.427e-03 4.373 1.22e-05 ***
## PC8 -1.866e-02 3.232e-03 -5.772 7.83e-09 ***
## PC9 3.373e-02 4.844e-03 6.963 3.33e-12 ***
## PC10 2.083e-02 4.393e-03 4.741 2.13e-06 ***
## PC11 -3.833e-02 3.957e-03 -9.688 < 2e-16 ***
## PC12 4.832e-02 6.092e-03 7.932 2.16e-15 ***
## PC13 -7.389e-04 3.941e-03 -0.187 0.851270
## PC14 2.695e-04 5.349e-03 0.050 0.959818
## PC15 -2.243e-02 5.228e-03 -4.290 1.79e-05 ***
## PC16 5.787e-03 5.476e-03 1.057 0.290646
## PC17 3.211e-02 6.519e-03 4.925 8.44e-07 ***
## PC18 -1.150e-02 4.828e-03 -2.383 0.017182 *
## PC19 2.613e-02 5.730e-03 4.559 5.13e-06 ***
## PC20 2.474e-02 6.633e-03 3.730 0.000191 ***
## PC21 1.025e-03 5.579e-03 0.184 0.854191
## PC22 -8.892e-03 5.643e-03 -1.576 0.115042
## PC23 -2.849e-02 6.272e-03 -4.542 5.58e-06 ***
## PC24 -3.183e-02 6.551e-03 -4.858 1.18e-06 ***
## PC25 -1.628e-02 6.246e-03 -2.606 0.009151 **
## PC26 -4.215e-02 6.493e-03 -6.492 8.47e-11 ***
## PC27 -3.173e-02 6.270e-03 -5.061 4.18e-07 ***
## PC28 1.554e-03 6.009e-03 0.259 0.795969
## PC29 -2.021e-03 6.175e-03 -0.327 0.743473
## PC30 -5.353e-03 6.919e-03 -0.774 0.439111
## PC31 -1.052e-02 5.911e-03 -1.780 0.075129 .
## PC32 -2.631e-02 6.459e-03 -4.073 4.63e-05 ***
## PC33 2.226e-02 6.345e-03 3.509 0.000450 ***
## PC34 -2.889e-02 6.815e-03 -4.239 2.25e-05 ***
## PC35 -7.135e-05 6.931e-03 -0.010 0.991788
## PC36 1.859e-02 6.446e-03 2.884 0.003930 **
## PC37 -5.004e-02 7.159e-03 -6.990 2.74e-12 ***
## PC38 -5.470e-02 7.197e-03 -7.600 2.97e-14 ***
## PC39 1.220e-02 7.429e-03 1.642 0.100536
## PC40 -3.146e-03 6.192e-03 -0.508 0.611398
## PC41 -6.560e-03 7.207e-03 -0.910 0.362646
## PC42 1.944e-03 6.832e-03 0.285 0.775947
## PC43 -1.526e-02 7.884e-03 -1.935 0.052952 .
## PC44 1.650e-03 9.048e-03 0.182 0.855316
## PC45 -3.833e-02 7.869e-03 -4.871 1.11e-06 ***
## PC46 1.606e-02 8.370e-03 1.919 0.055021 .
## PC47 -7.412e-05 7.625e-03 -0.010 0.992244
## PC48 -2.617e-02 8.272e-03 -3.164 0.001558 **
## PC49 -6.774e-02 1.017e-02 -6.661 2.72e-11 ***
## PC50 -3.344e-02 1.068e-02 -3.132 0.001737 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.119 on 513061 degrees of freedom
## Multiple R-squared: 0.002908, Adjusted R-squared: 0.002811
## F-statistic: 29.93 on 50 and 513061 DF, p-value: < 2.2e-16
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
## (Intercept) PC2 PC6 PC11 PC1 PC12
## 543.96825951 18.63734441 14.75601531 9.68774084 8.32505501 7.93195145
## PC38 PC37 PC9 PC49 PC26 PC8
## 7.59991594 6.99044956 6.96322977 6.66139593 6.49220150 5.77226133
## PC27 PC17 PC45 PC24 PC10 PC19
## 5.06054861 4.92493341 4.87126841 4.85847391 4.74114729 4.55933143
## PC23 PC5 PC7 PC15 PC34 PC32
## 4.54189835 4.45787240 4.37339085 4.29022905 4.23860318 4.07344260
## PC20 PC33 PC48 PC50 PC36 PC4
## 3.73022971 3.50861524 3.16358446 3.13183006 2.88369979 2.83812933
## PC25 PC18 PC43 PC46 PC31 PC39
## 2.60638907 2.38279482 1.93531668 1.91871324 1.77967729 1.64226599
## PC22 PC3 PC16 PC41 PC30 PC40
## 1.57593214 1.45036000 1.05670698 0.91033486 0.77369530 0.50807895
## PC29 PC42 PC28 PC13 PC21 PC44
## 0.32725757 0.28460505 0.25856747 0.18749789 0.18377326 0.18233942
## PC14 PC35 PC47
## 0.05038217 0.01029303 0.00972038
Principal component 2 is picked out first:
But this only has a correlation of -0.004 with p…
regression_df <- data.frame(chisq = log(p_df$european_ancestry_pval_rand)[ind], coords[ind,1:50])
lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
##
## Call:
## lm(formula = chisq ~ ., data = regression_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.463 -0.372 0.326 0.733 1.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.041e+00 1.844e-03 -564.544 < 2e-16 ***
## dim.1 -7.927e-03 5.299e-04 -14.960 < 2e-16 ***
## dim.2 1.290e-04 7.392e-04 0.175 0.861448
## dim.3 -2.216e-03 8.290e-04 -2.673 0.007511 **
## dim.4 9.927e-05 8.789e-04 0.113 0.910065
## dim.5 -8.606e-03 8.646e-04 -9.954 < 2e-16 ***
## dim.6 9.369e-03 9.980e-04 9.388 < 2e-16 ***
## dim.7 9.483e-04 1.019e-03 0.930 0.352139
## dim.8 -1.024e-02 1.021e-03 -10.023 < 2e-16 ***
## dim.9 -7.218e-03 1.207e-03 -5.979 2.24e-09 ***
## dim.10 6.982e-03 1.110e-03 6.292 3.13e-10 ***
## dim.11 -1.682e-02 1.199e-03 -14.031 < 2e-16 ***
## dim.12 5.467e-03 1.183e-03 4.622 3.80e-06 ***
## dim.13 3.195e-03 1.202e-03 2.659 0.007836 **
## dim.14 -1.444e-02 1.217e-03 -11.867 < 2e-16 ***
## dim.15 -8.744e-03 1.277e-03 -6.848 7.47e-12 ***
## dim.16 -9.360e-04 1.354e-03 -0.691 0.489256
## dim.17 1.245e-03 1.378e-03 0.903 0.366280
## dim.18 -4.661e-04 1.356e-03 -0.344 0.730982
## dim.19 8.045e-04 1.430e-03 0.563 0.573774
## dim.20 -1.064e-03 1.685e-03 -0.631 0.527771
## dim.21 3.224e-03 1.636e-03 1.971 0.048708 *
## dim.22 2.503e-03 1.562e-03 1.602 0.109124
## dim.23 7.188e-06 1.597e-03 0.005 0.996407
## dim.24 -2.258e-03 1.540e-03 -1.466 0.142555
## dim.25 -2.710e-04 1.493e-03 -0.182 0.855951
## dim.26 -1.455e-03 1.449e-03 -1.004 0.315244
## dim.27 5.152e-04 1.395e-03 0.369 0.711793
## dim.28 -7.076e-03 1.354e-03 -5.227 1.72e-07 ***
## dim.29 -5.331e-04 1.340e-03 -0.398 0.690640
## dim.30 3.608e-03 1.442e-03 2.503 0.012323 *
## dim.31 5.673e-03 1.425e-03 3.980 6.89e-05 ***
## dim.32 4.833e-03 1.432e-03 3.375 0.000739 ***
## dim.33 -5.546e-03 1.435e-03 -3.863 0.000112 ***
## dim.34 1.047e-03 1.431e-03 0.731 0.464492
## dim.35 3.089e-03 1.441e-03 2.143 0.032106 *
## dim.36 -4.905e-04 1.454e-03 -0.337 0.735847
## dim.37 4.267e-04 1.437e-03 0.297 0.766443
## dim.38 2.691e-03 1.434e-03 1.876 0.060692 .
## dim.39 -8.609e-04 1.420e-03 -0.606 0.544350
## dim.40 -3.962e-03 1.415e-03 -2.800 0.005104 **
## dim.41 1.717e-03 1.423e-03 1.207 0.227563
## dim.42 4.741e-04 1.478e-03 0.321 0.748352
## dim.43 -2.994e-03 1.517e-03 -1.974 0.048436 *
## dim.44 1.459e-03 1.535e-03 0.950 0.341897
## dim.45 3.016e-03 1.525e-03 1.978 0.047961 *
## dim.46 -1.168e-03 1.545e-03 -0.756 0.449377
## dim.47 2.013e-03 1.557e-03 1.293 0.196058
## dim.48 -1.057e-03 1.553e-03 -0.681 0.496079
## dim.49 -1.792e-03 1.559e-03 -1.150 0.250324
## dim.50 -8.265e-03 1.510e-03 -5.473 4.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.12 on 513061 degrees of freedom
## Multiple R-squared: 0.002202, Adjusted R-squared: 0.002105
## F-statistic: 22.64 on 50 and 513061 DF, p-value: < 2.2e-16
# pick dimension with biggest absolute t statistic
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
## (Intercept) dim.1 dim.11 dim.14 dim.8 dim.5
## 5.645443e+02 1.495963e+01 1.403073e+01 1.186671e+01 1.002277e+01 9.954240e+00
## dim.6 dim.15 dim.10 dim.9 dim.50 dim.28
## 9.388487e+00 6.848461e+00 6.292132e+00 5.979472e+00 5.473031e+00 5.227158e+00
## dim.12 dim.31 dim.33 dim.32 dim.40 dim.3
## 4.622307e+00 3.980195e+00 3.863477e+00 3.374887e+00 2.800387e+00 2.673290e+00
## dim.13 dim.30 dim.35 dim.45 dim.43 dim.21
## 2.659058e+00 2.502761e+00 2.143097e+00 1.977717e+00 1.973531e+00 1.971139e+00
## dim.38 dim.22 dim.24 dim.47 dim.41 dim.49
## 1.875740e+00 1.602144e+00 1.466345e+00 1.292867e+00 1.206662e+00 1.149563e+00
## dim.26 dim.44 dim.7 dim.17 dim.46 dim.34
## 1.004281e+00 9.504247e-01 9.304491e-01 9.034641e-01 7.564546e-01 7.314712e-01
## dim.16 dim.48 dim.20 dim.39 dim.19 dim.29
## 6.914926e-01 6.806721e-01 6.314134e-01 6.062492e-01 5.625025e-01 3.979865e-01
## dim.27 dim.18 dim.36 dim.42 dim.37 dim.25
## 3.694496e-01 3.438194e-01 3.373585e-01 3.208131e-01 2.970309e-01 1.815306e-01
## dim.2 dim.4 dim.23
## 1.745319e-01 1.129572e-01 4.502561e-03
This looks to be the most promising:
But then when we do the next regression to pick out the next most significant dimension, there is a non-monotonic relationship (and we know that none of the cFDR methods can simultaneously decrease \(v\)-values for low and high \(q\)).
##
## Call:
## lm(formula = chisq ~ ., data = regression_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.349 -0.375 0.328 0.733 1.501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0481724 0.0018385 -570.137 < 2e-16 ***
## dim.2 -0.0022101 0.0007426 -2.976 0.002921 **
## dim.3 0.0097650 0.0008305 11.758 < 2e-16 ***
## dim.4 0.0064289 0.0008829 7.281 3.31e-13 ***
## dim.5 -0.0227496 0.0008670 -26.240 < 2e-16 ***
## dim.6 0.0024284 0.0010021 2.423 0.015375 *
## dim.7 -0.0051504 0.0010232 -5.033 4.82e-07 ***
## dim.8 -0.0174233 0.0010261 -16.980 < 2e-16 ***
## dim.9 -0.0161535 0.0012117 -13.331 < 2e-16 ***
## dim.10 -0.0050019 0.0011137 -4.491 7.08e-06 ***
## dim.11 -0.0126495 0.0012045 -10.502 < 2e-16 ***
## dim.12 0.0035096 0.0011884 2.953 0.003145 **
## dim.13 0.0052779 0.0012071 4.372 1.23e-05 ***
## dim.14 -0.0254585 0.0012211 -20.850 < 2e-16 ***
## dim.15 -0.0154245 0.0012822 -12.030 < 2e-16 ***
## dim.16 -0.0045766 0.0013600 -3.365 0.000765 ***
## dim.17 0.0025280 0.0013848 1.826 0.067910 .
## dim.18 0.0018608 0.0013619 1.366 0.171855
## dim.19 0.0035365 0.0014370 2.461 0.013852 *
## dim.20 0.0003097 0.0016934 0.183 0.854897
## dim.21 0.0048991 0.0016435 2.981 0.002874 **
## dim.22 0.0053037 0.0015697 3.379 0.000728 ***
## dim.23 0.0004700 0.0016042 0.293 0.769522
## dim.24 -0.0012644 0.0015472 -0.817 0.413805
## dim.25 -0.0008798 0.0015002 -0.586 0.557566
## dim.26 -0.0021937 0.0014556 -1.507 0.131778
## dim.27 -0.0010790 0.0014013 -0.770 0.441316
## dim.28 -0.0065684 0.0013603 -4.829 1.37e-06 ***
## dim.29 -0.0017229 0.0013460 -1.280 0.200549
## dim.30 0.0067465 0.0014484 4.658 3.20e-06 ***
## dim.31 0.0062229 0.0014323 4.345 1.39e-05 ***
## dim.32 0.0021651 0.0014389 1.505 0.132415
## dim.33 -0.0088083 0.0014423 -6.107 1.02e-09 ***
## dim.34 0.0059362 0.0014376 4.129 3.64e-05 ***
## dim.35 0.0028960 0.0014482 2.000 0.045525 *
## dim.36 -0.0025305 0.0014610 -1.732 0.083263 .
## dim.37 -0.0018640 0.0014435 -1.291 0.196603
## dim.38 0.0034859 0.0014413 2.418 0.015586 *
## dim.39 -0.0061993 0.0014267 -4.345 1.39e-05 ***
## dim.40 -0.0026108 0.0014218 -1.836 0.066313 .
## dim.41 0.0033790 0.0014298 2.363 0.018116 *
## dim.42 -0.0002966 0.0014851 -0.200 0.841712
## dim.43 -0.0080105 0.0015243 -5.255 1.48e-07 ***
## dim.44 -0.0071324 0.0015414 -4.627 3.71e-06 ***
## dim.45 0.0090286 0.0015321 5.893 3.80e-09 ***
## dim.46 -0.0022935 0.0015521 -1.478 0.139493
## dim.47 0.0030492 0.0015647 1.949 0.051329 .
## dim.48 0.0005422 0.0015602 0.348 0.728189
## dim.49 -0.0025720 0.0015666 -1.642 0.100639
## dim.50 -0.0110954 0.0015173 -7.312 2.63e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.125 on 513062 degrees of freedom
## Multiple R-squared: 0.004647, Adjusted R-squared: 0.004552
## F-statistic: 48.89 on 49 and 513062 DF, p-value: < 2.2e-16
## (Intercept) dim.5 dim.14 dim.8 dim.9 dim.15
## 570.1366779 26.2395940 20.8495245 16.9804253 13.3311349 12.0295178
## dim.3 dim.11 dim.50 dim.4 dim.33 dim.45
## 11.7579051 10.5020654 7.3124673 7.2814326 6.1069464 5.8927547
## dim.43 dim.7 dim.28 dim.30 dim.44 dim.10
## 5.2551898 5.0333674 4.8286974 4.6577501 4.6271788 4.4913165
## dim.13 dim.39 dim.31 dim.34 dim.22 dim.16
## 4.3723686 4.3453439 4.3447590 4.1291304 3.3788782 3.3652654
## dim.21 dim.2 dim.12 dim.19 dim.6 dim.38
## 2.9809244 2.9759797 2.9532475 2.4610764 2.4234276 2.4184841
## dim.41 dim.35 dim.47 dim.40 dim.17 dim.36
## 2.3632491 1.9997722 1.9487252 1.8363028 1.8256055 1.7320637
## dim.49 dim.26 dim.32 dim.46 dim.18 dim.37
## 1.6417677 1.5071283 1.5046487 1.4776837 1.3662705 1.2912927
## dim.29 dim.24 dim.27 dim.25 dim.48 dim.23
## 1.2799920 0.8172175 0.7699732 0.5864616 0.3475355 0.2930006
## dim.42 dim.20
## 0.1997036 0.1828739
We downloaded GenoCanyon scores for the 1,978,016 SNPs in the Asthma data set, which represent the union of functional elements in different cell types. The GenoWAP software requires a `threshold’ parameter defining functional loci according to the functional score, for which we used the default recommended value of 0.1, which corresponds to 40% of the SNPs in our data set being “functional”. We then ran the GenoWAP python script to obtain posterior scores for each SNP.
We also ran functional cFDR using the GenoCanyon scores as the auxiliary functional data.
Of the 1,978,016 SNPs in the smaller Asthma GWAS data set (our “principal trait”), 1,971,252 were also found in the larger UKBB Asthma data set.
For the 4269 SNPs that reached genome-wide significance in the larger UKBB GWAS, we compared the rank of the posterior scores from GenoWAP and the v-values from functional cFDR with the rank of the original \(p\)-value in the smaller Asthma GWAS data set. When using GenoWAP, 60.6% (2590/4269) had an improved or equal rank compared to 48.3% when using functional cFDR (2063/4269).
The venn diagram below shows the number of UKBB significant SNPs that recieved an improved or equal rank after applying each of the two methods (the white area outside should be 1403).
Similarly, of the 1,966,983 SNPs that did not reach genome-wide significance in the larger UKBB GWAS, 60.6% had a decreased or equal rank after applying GenoWAP compared to 30.6% when using functional cFDR. The cross over of these is shown in the venn diagram below.
Whilst the pattern of the change in ranking is similar after applying the two methods, the rankings alter more when applying the GenoWAP method. Our method intrinsically determines the relevance of the auxiliary data and will reweight the \(p\)-values accordingly. In the GenoWAP example, the correlation between the \(p\)-values and the GenoCanyon scores (\(q\)) is very low (0.017) indicating that deriving a score that correlates with all GWAS results is a very difficult problem and that most “scores” of this type only capture a small proportion of the variability. This highlights the need for iterative methods that can leverage several layers of auxiliary data in a statistically robust framework.
Note: In FINDOR “To ensure a fair comparison to our approach, we ranked SNPs based on the resulting posterior probability and counted the number of independent GWAS loci in the top K SNPs, where K was the total number of genome-wide significant SNPs reported by FINDOR for the corresponding functional criteria.”
Also in GenoWAP: “We ranked the 298 321 SNPs in the NIDDK study based on their P-values and posterior scores, respectively. Then, within each of the 71 loci, we compared the rank of the lowest P-value to the rank of the largest posterior score. 56 out of 71 loci (79%) had an improved rank, 3 loci (4%) had an equal rank, while only 12 loci (17%) had a reduced rank (Supplementary Table S3). The probability of having an increased rank is significantly higher than that of having a decreased rank (P-value¼ 3:11 108, one-sided binomial test).” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963360/pdf/btv610.pdf)
Think I need to compare at the loci level.
Comments and queries
“The primary metric of interest in both real and simulated data was the number of independent GWAS loci (at a level of p < 5 × 10−8) that the various methodologies identified. We conservatively define independent loci using PLINK’s LD-clumping algorithm with a 5 Mb window and an r2 threshold of 0.01.”
Useful links: https://privefl.github.io/bigsnpr/articles/pruning-vs-clumping.html; https://www.cog-genomics.org/plink/1.9/postproc#clump; https://www.cog-genomics.org/plink/1.9/formats#assoc; https://www.cog-genomics.org/plink/1.9/formats#clumped
Need to think about a name for the method. Extention to continuous covariates could be called “continuous cFDR” (maybe ccFDR?) and then the application to using functional data with GWAS p-values could be “functional cFDR”?
R package: I’m thinking of calling it
ccFDRwith only a couple of “all-in-one” functions (rather than James’ that has different functions for L curve generation and integration):